Ames Housing in R

smart-R

JADS Foundation - cohort May 2023
Author

Floris Padt

Published

02-06-2023

Modified

09-06-2023

Abstract

In the Ames Housing dataset, which is commonly used for predicting housing prices, there are several
features that can significantly influence the sales price of a house. The importance of these features
can vary depending on the specific dataset and the machine learning algorithm used for analysis.
However, based on general observations and common practices,
the following features are often considered as strong predictors of housing prices:

1. Overall Quality: The overall quality of a house, usually measured on a scale from 1 to 10, is a crucial factor affecting its sales price. Higher-quality homes tend to command higher prices.
2. Above Ground Living Area: The size of the above ground living area, typically measured in square feet, is a strong indicator of a house’s value. Larger houses generally have higher prices.
3. Number of Bedrooms: The number of bedrooms in a house is an important factor for many buyers. Houses with more bedrooms are typically priced higher.
4. Number of Bathrooms: Similarly, the number of bathrooms in a house plays a significant role in determining its value. More bathrooms often lead to higher prices.
5. Lot Size: The size of the lot on which a house is situated can influence its price. Larger lots are generally associated with higher prices, especially in desirable locations.
6. Neighborhood: The neighborhood in which a house is located can have a significant impact on its value.

Ames Housing in R

Code
uri <- paste0(github_ames, "AmesHousing.csv")

df =  read.csv(uri) # data.frame
dt <- fread(uri)    # data.table

Introduction

The Ames Housing dataset contains information from the Ames Assessor’s Office used in computing assessed values for individual residential properties sold in Ames, Iowa [IA] from 2006 to 2010.
The dataset has 2,930 observations with 82 variables (23 nominal, 23 ordinal, 14 discrete, and 20 continuous). For a complete description of all included variables, please look at: https://rdrr.io/cran/AmesHousing/man/ames_raw.html.

syllabus Discover Projects

Exercise 1:

Familiarize yourself with the data.

Provide a table with descriptive statistics for all included variables and check:

  • Classes of each of the variables (e.g. factors or continuous variables).

  • Descriptive/summary statistics for all continuous variables (e.g. mean, SD, range) and factor variables (e.g. frequencies).

  • Explore missing values: sapply(df, function(x) sum(is.na(x)))

Data Set

Code
dt %>%
setcolorder(c("Order", "SalePrice")) %>%
DT::datatable(
  caption = "Table 1: Ames Housing dataset",
  class = "compact stripe",
  rownames = FALSE,
  filter = 'top',
  extensions = c('FixedColumns'),
  options = list(
    scrollX      = TRUE,
    fixedColumns = list(leftColumns = 2)
    )
 )  %>% 
  formatCurrency("SalePrice", '\U0024', digits = 0) %>%
  formatStyle(
    'SalePrice',
    color              = "#003700",
    fontWeight         = "bold",
    backgroundColor    = '#FFFFF0',     
    backgroundSize     = '100% 60%',
    backgroundRepeat   = 'no-repeat',
    backgroundPosition = 'center'
  ) %>%
  formatStyle(
    'Order',  
    color              = '#C0C0C0', 
    backgroundColor    = '#FFFFF0'
  )
Note
  • Use the base-R function str (no package needed)
  • Use the describe function (from the psych-package) for continuous variables
  • Use the table function (base-R) for factor variables.
Code
# To check the structure of the data, you can use the "str"-command:
# str(dt)

# create a table with the type of the data
dt_str <-
  dt[, lapply(.SD, typeof)]               %>% 
  melt.data.table(
    measure.vars    = names(.),
    variable.factor = FALSE)              %>%
  setorder(value, variable )             

# display a summery per type
dt_str %>%
  .[, .(count = .N), by = value] %>%
  DT::datatable(
    caption = "Table 2: Data structure summary",
    class = "compact stripe",
    rownames = FALSE,
    options = list(
      dom = "t"
    )
  ) %>%
  formatStyle(
    "value",
    color              = "#370037",
    backgroundColor    = "#FFFFF0",
    fontWeight         = "bold"
  )
Code
# display structure/type of the data  
dt_str %>%
  DT::datatable(
    caption = "Table 3: Data structure and types",
    class = "compact stripe",
    rownames = FALSE,
    filter = "top"
  ) %>%
    formatStyle(
      "variable",
      color              = "#370037",
      backgroundColor    = "#FFFFF0",
      fontWeight         = "bold"
    )
Code
dt_chr <- dt_str[value == "character", variable]
dt_int <- dt_str[value == "integer", variable]

All factor variables now have the ‘character’ class.
The following code helps to convert each character variable into a factor variable:

Code
df[sapply(df, is.character)] <- lapply(df[sapply(df, is.character)], as.factor)
# str(df)

# convert character variables to factor variables
chr2fct <- function(x){
    if(is.character(x)) 
      as.factor(x) 
    else 
      x
  }

# convert character variables to factor variables
# keep the integers
dt[, names(dt):= lapply(.SD, chr2fct)]

# display the factors and levels
str(dt[, ..dt_chr])
Classes 'data.table' and 'data.frame':  2930 obs. of  43 variables:
 $ Alley         : Factor w/ 2 levels "Grvl","Pave": NA NA NA NA NA NA NA NA NA NA ...
 $ Bldg Type     : Factor w/ 5 levels "1Fam","2fmCon",..: 1 1 1 1 1 1 5 5 5 1 ...
 $ Bsmt Cond     : Factor w/ 6 levels "","Ex","Fa","Gd",..: 4 6 6 6 6 6 6 6 6 6 ...
 $ Bsmt Exposure : Factor w/ 5 levels "","Av","Gd","Mn",..: 3 5 5 5 5 5 4 5 5 5 ...
 $ Bsmt Qual     : Factor w/ 6 levels "","Ex","Fa","Gd",..: 6 6 6 6 4 6 4 4 4 6 ...
 $ BsmtFin Type 1: Factor w/ 7 levels "","ALQ","BLQ",..: 3 6 2 2 4 4 4 2 4 7 ...
 $ BsmtFin Type 2: Factor w/ 7 levels "","ALQ","BLQ",..: 7 5 7 7 7 7 7 7 7 7 ...
 $ Central Air   : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 2 2 2 ...
 $ Condition 1   : Factor w/ 9 levels "Artery","Feedr",..: 3 2 3 3 3 3 3 3 3 3 ...
 $ Condition 2   : Factor w/ 8 levels "Artery","Feedr",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ Electrical    : Factor w/ 6 levels "","FuseA","FuseF",..: 6 6 6 6 6 6 6 6 6 6 ...
 $ Exter Cond    : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 5 5 5 5 ...
 $ Exter Qual    : Factor w/ 4 levels "Ex","Fa","Gd",..: 4 4 4 3 4 4 3 3 3 4 ...
 $ Exterior 1st  : Factor w/ 16 levels "AsbShng","AsphShn",..: 4 14 15 4 14 14 6 7 6 14 ...
 $ Exterior 2nd  : Factor w/ 17 levels "AsbShng","AsphShn",..: 11 15 16 4 15 15 6 7 6 15 ...
 $ Fence         : Factor w/ 4 levels "GdPrv","GdWo",..: NA 3 NA NA 3 NA NA NA NA NA ...
 $ Fireplace Qu  : Factor w/ 5 levels "Ex","Fa","Gd",..: 3 NA NA 5 5 3 NA NA 5 5 ...
 $ Foundation    : Factor w/ 6 levels "BrkTil","CBlock",..: 2 2 2 2 3 3 3 3 3 3 ...
 $ Functional    : Factor w/ 8 levels "Maj1","Maj2",..: 8 8 8 8 8 8 8 8 8 8 ...
 $ Garage Cond   : Factor w/ 6 levels "","Ex","Fa","Gd",..: 6 6 6 6 6 6 6 6 6 6 ...
 $ Garage Finish : Factor w/ 4 levels "","Fin","RFn",..: 2 4 4 2 2 2 2 3 3 2 ...
 $ Garage Qual   : Factor w/ 6 levels "","Ex","Fa","Gd",..: 6 6 6 6 6 6 6 6 6 6 ...
 $ Garage Type   : Factor w/ 6 levels "2Types","Attchd",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ Heating       : Factor w/ 6 levels "Floor","GasA",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ Heating QC    : Factor w/ 5 levels "Ex","Fa","Gd",..: 2 5 5 1 3 1 1 1 1 3 ...
 $ House Style   : Factor w/ 8 levels "1.5Fin","1.5Unf",..: 3 3 3 3 6 6 3 3 3 6 ...
 $ Kitchen Qual  : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 3 1 5 3 3 3 3 3 ...
 $ Land Contour  : Factor w/ 4 levels "Bnk","HLS","Low",..: 4 4 4 4 4 4 4 2 4 4 ...
 $ Land Slope    : Factor w/ 3 levels "Gtl","Mod","Sev": 1 1 1 1 1 1 1 1 1 1 ...
 $ Lot Config    : Factor w/ 5 levels "Corner","CulDSac",..: 1 5 1 1 5 5 5 5 5 5 ...
 $ Lot Shape     : Factor w/ 4 levels "IR1","IR2","IR3",..: 1 4 1 4 1 1 4 1 1 4 ...
 $ MS Zoning     : Factor w/ 7 levels "A (agr)","C (all)",..: 6 5 6 6 6 6 6 6 6 6 ...
 $ Mas Vnr Type  : Factor w/ 6 levels "","BrkCmn","BrkFace",..: 6 5 3 5 5 3 5 5 5 5 ...
 $ Misc Feature  : Factor w/ 5 levels "Elev","Gar2",..: NA NA 2 NA NA NA NA NA NA NA ...
 $ Neighborhood  : Factor w/ 28 levels "Blmngtn","Blueste",..: 16 16 16 16 9 9 25 25 25 9 ...
 $ Paved Drive   : Factor w/ 3 levels "N","P","Y": 2 3 3 3 3 3 3 3 3 3 ...
 $ Pool QC       : Factor w/ 4 levels "Ex","Fa","Gd",..: NA NA NA NA NA NA NA NA NA NA ...
 $ Roof Matl     : Factor w/ 8 levels "ClyTile","CompShg",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ Roof Style    : Factor w/ 6 levels "Flat","Gable",..: 4 2 4 4 2 2 2 2 2 2 ...
 $ Sale Condition: Factor w/ 6 levels "Abnorml","AdjLand",..: 5 5 5 5 5 5 5 5 5 5 ...
 $ Sale Type     : Factor w/ 10 levels "COD","Con","ConLD",..: 10 10 10 10 10 10 10 10 10 10 ...
 $ Street        : Factor w/ 2 levels "Grvl","Pave": 2 2 2 2 2 2 2 2 2 2 ...
 $ Utilities     : Factor w/ 3 levels "AllPub","NoSeWa",..: 1 1 1 1 1 1 1 1 1 1 ...
 - attr(*, ".internal.selfref")=<externalptr> 

Explore missing values

Create a table with the number of missing values per variable.

Code
# sapply(df, function(x) sum(is.na(x))) 

# table of missing values per variable
f_kbl_with_NA(dt) %>%
  kable_styling(
    full_width      = FALSE, 
    position        = "left",
    htmltable_class = "lighttable-hover lighttable-condensed lightable-striped"
    ) 
variable value
Pool QC 2917
Misc Feature 2824
Alley 2732
Fence 2358
Fireplace Qu 1422
Lot Frontage 490
Garage Yr Blt 159
Garage Qual 158
Garage Cond 158
Garage Type 157
Garage Finish 157
Bsmt Qual 79
Bsmt Cond 79
Bsmt Exposure 79
BsmtFin Type 1 79
BsmtFin Type 2 79
Mas Vnr Area 23
Bsmt Full Bath 2
Bsmt Half Bath 2
BsmtFin SF 1 1
BsmtFin SF 2 1
Bsmt Unf SF 1
Total Bsmt SF 1
Garage Cars 1
Garage Area 1

Descriptive statistics

Create a table with descriptive statistics for all included variables.
For continuous variables, you can use the describe function (from the psych-package).
For factor variables, you can use the table function (base-R).

Code
dt[, psych::describe(.SD), .SDcols = dt_int] %>%
  as.data.table(keep.rownames = "cont_vars") %>%
  DT::datatable(
    caption  = "Table 4: Describe numerics",
    class    = "stripe",
    rownames = FALSE,
    filter   = "top",
    extensions = c('FixedColumns'),
    options = list(
      scrollX      = TRUE,
      fixedColumns = list(leftColumns = 1)
    )
  ) %>%
  formatStyle(
    "cont_vars",
    color              = "#370037",
    backgroundColor    = "#FFFFF0",
    fontWeight         = "bold"
  )
Code
my_cnt <-
  function(x){
    data.table(col = x) %>%
    .[, .(cnt = .N), by = col] 
}

dt<=dtb 

dt[, (names(dt)) := lapply(.SD, as.factor), .SDcols = sapply(dt, is.character)]

# Reshape the data.table into long format
cols <- sapply(dt, is.factor) %>% .[.==TRUE]
dt6  <- dt[, ..cols]

dt_long <- melt(dt6, measure.vars = names(dt6), variable.name = "Column")

# Create bar chart for each column
ggplot(dt_long, aes(x = fct_infreq(value))) +
  geom_bar() +
  facet_wrap(~Column, scales = "free_x") +
  labs(x = "Value", y = "Count") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

tst<- lapply(dt5, my_cnt) 

  dt[, .()]
  ggplot( mapping = aes(x = f, y = cnt)) +
    geom_col() +
    coord_flip() +
    facet_wrap(facets = vars(c), scales = "free")

temp <- 
  df %>%
  purrr::keep(is.factor)

for (i in 1:ncol(temp)) {
  print(names(temp[i]))
  print(table(temp[, i]))
}

Exercise 2:

There a several missing values in the dataset, which need to be tackled before we can proceed with the rest of the analysis.

There are many ways to impute missing values, but for now, impute missing values for numeric variables with the median, and impute missings in all factor variables with the label “100”.

Imputation of missing values for numeric variables

Code
# impute NA with median for all numeric variables
dt[, (dt_int) := lapply(.SD, function(x){
  ifelse(is.na(x), median(x, na.rm=T), x)}), .SDcols = dt_int]

# table of missing values per variable
f_kbl_with_NA(dt)
variable value
Pool QC 2917
Misc Feature 2824
Alley 2732
Fence 2358
Fireplace Qu 1422
Garage Qual 158
Garage Cond 158
Garage Type 157
Garage Finish 157
Bsmt Qual 79
Bsmt Cond 79
Bsmt Exposure 79
BsmtFin Type 1 79
BsmtFin Type 2 79
Code
df <-
  lapply(df, function(x) {
    ### Impute median for all missing numeric values
    if(is.numeric(x)) ifelse(is.na(x), median(x, na.rm=T), x) else x
  }
  ) %>%
  data.frame()

Imputation of missing values for factor variables

Code
# generate a vector with variable names for all factor variables
factor_variables <- 
  df              %>%
  keep(is.factor) %>% 
  names

# impute missing values for factor variables
df<-
  lapply(df,function(x) {
    if(is.factor(x)) ifelse(is.na(x),"100",x) else x
  }) %>%
  data.frame()

# 100 imputation for factor variables
dt[, (dt_chr) := lapply(.SD, function(x) {
  ifelse(is.na(x), "100", as.character(x))
}), .SDcols = dt_chr]

# convert factor variables back to factor variables 
# (imputation turned them into character variables)
df[factor_variables] <- lapply(df[factor_variables], factor)
dt[, (dt_chr) := lapply(.SD, as.factor), .SDcols = dt_chr]

check for missing values

Code
# sapply(df, function(x) sum(is.na(x)))

# table of missing values per variable
f_kbl_with_NA(dt)
variable value

check for blank values

before imputation

Code
# table of blank values per variable
dt[, lapply(.SD, function(x) sum(trimws(x) == '', na.rm = TRUE))] %>%
    melt.data.table(measure.vars = names(.))     %>%
    .[value > 0] %>%
    setorder(-value) %>%
    kbl(
      align = "l"
    )
variable value
Mas Vnr Type 23
Bsmt Exposure 4
BsmtFin Type 2 2
Garage Finish 2
Bsmt Qual 1
Bsmt Cond 1
BsmtFin Type 1 1
Electrical 1
Garage Qual 1
Garage Cond 1

impute ‘100’

Code
# 100 imputation for factor variables
dt[, (dt_chr) := lapply(.SD, function(x) {
  ifelse(x == '', "100", as.character(x))
}), .SDcols = dt_chr]

# convert factor variables back to factor variables 
dt[, (dt_chr) := lapply(.SD, as.factor), .SDcols = dt_chr]

after imputation

Code
# table of blank values per variable
dt[, lapply(.SD, function(x) sum(trimws(x) == '', na.rm = TRUE))] %>%
    melt.data.table(measure.vars = names(.))     %>%
    .[value > 0] %>%
    setorder(-value) %>%
    kbl(
      align = "l"
    )
variable value

list the variables and their values

Code
dtVV <- 
  dt[, ..dt_chr] %>%
  melt.data.table(measure.vars = dt_chr) %>%
  unique() 

dtVV %>%
  DT::datatable(
    caption    = "Table 4a: Variable Values",
    class      = "stripe",
    rownames   = FALSE,
    filter     = "top",
    extensions = c("FixedColumns"),
    options = list(
      scrollX      = TRUE,
      fixedColumns = list(leftColumns = 1)
    )
  ) %>%
  formatStyle(
    "variable",
    color              = "#370037",
    backgroundColor    = "#FFFFF0",
    fontWeight         = "bold"
  )

Exercise 4 -

Explore the outcome variable (SalePrice) and how it correlates to other features

The variable “SalePrice” refers to the price at which a property was sold and hence is the variable of interest for our prediction model (“Y” or dependent variable).

Please explore Y in terms of:

  1. Conduct descriptive/summary statistics on the Y variable (mean, median, SD, range)
  2. Investigate how neighborhood (categorical) and grand living area (continuous) relate to the Y variable; use, e.g., bar charts, scatter plots, boxplots
  3. Visualize the distribution of the Y variable. What do you observe?
  4. Assess the distribution of SalePrice in the previous exercise. What do you observe? Log-transform the outcome variable. What does it mean for the performance of the prediction model?
  5. Assess grand living area (‘Gr Liv Area’) for all houses in previous exercise. What do you observe? Remove outliers. What does it mean for the applicability of the prediction model?
  6. Draw scatter plots between Y and all numerical features
  7. Draw correlation plots to see all correlations between Y and the independent (continuous) variables (Hint: calculate Pearson correlation coefficient)
  • Visualize the distribution of Y (e.g. use base-R “hist” or “ggplot” from the “ggplot2”-package)

  • Visualize the distribution of Y by looking at various subgroups
    (e.g. create boxplot or scatterplot using the “ggplot2”-package).

  • Look at differences between neighborhoods.

  • Look at differences between housing style.

  • Draw a correlation plot to see all correlations between Y and the independent (numeric) variables.

Note

For visualization, ggplot is frequently used as it provides a flexible way to draw a lot of different graphs.

ggplot contains two basic elements:

  1. The initiation command:
    ggplot(DATASET, aes(x=XVAR, y=YVAR, group=XVAR))
    This draws a blank ggplot. Even though the x and y are specified, there are no points or lines in it.

  2. Add the respective geom of interest (for this exercise you’ll need:
    + geom_point() (for scatterplot) or
    + geom_boxplot()

The full code to write a scatter plot would then be:

ggplot(DATASET, aes(x=XVAR, y=YVAR)) + geom_point()

Note

To draw a correlation plot. Please use the “corrplot”-package.
Using this package, one can construct a correlation plot in two steps:

  1. Use “cor” to calculate correlation between all combinations of numeric variables
    select numeric variables by using: df %>% keep(is.numeric)

  2. Plot the calculated correlation by using the corrplot -function

Descriptive/summary statistics

Code
# Descriptive/summary statistics (e.g. mean, SDs, range)

dt$SalePrice                   %>%
  psych::describe()            %>%
  t()                          %>%
  as.data.table(
    keep.rownames = "stat")    %>%
  .[, .(stat, 
        SalesPrice = X1)]      %>%
  kbl(
    digits      = 0,
    caption     = "Table 5: Descriptive statistics for Sales Price",
    format.args = list(big.mark = ","),
    align       = 'l'
  ) %>%
  kable_styling(
    full_width      = FALSE, 
    position        = "left",
    htmltable_class = "lighttable-hover lighttable-condensed lightable-striped") 
Table 5: Descriptive statistics for Sales Price
stat SalesPrice
vars 1
n 2,930
mean 180,796
sd 79,887
median 160,000
trimmed 170,429
mad 54,856
min 12,789
max 755,000
range 742,211
skew 2
kurtosis 5
se 1,476

Visualize the distribution of Y

Code
# Visualize the distribution of Y 
# (e.g. use base-R "hist" or "ggplot" from the "ggplot2"-package)
hist(dt$SalePrice)

Code
ggplot(data = dt, aes(SalePrice)) + 
  geom_histogram(fill = "#005100", color = "#FFFFF0", bins = 18) + 
  # scale_x_continuous(limits = c(0,600000), expand = c(0, 0)) +
  # scale_y_continuous(limits = c(0,650)   , expand = c(0, 0)) +
  labs(title = "Histogram of Sale Price") +
  ylab(label = "Count") + 
  xlab(label = "Sale Price") +
  # theme_classic() +
  theme(
    axis.title.x = element_text(
      colour = "#370037", size = 11.5, face = "bold"), 
    axis.title.y = element_text(
      colour = "#370037", size = 11.5, face = "bold"),
    plot.title = element_text(
      colour = "#370037", size = 18  , face = "bold", hjust = 0)
  ) 

Visualize Y by Lot Area and Neighbourhood

Code
# Visualize the distribution of Y by looking at various subgroups 
# (e.g. create boxplot or scatterplot using the "ggplot2"-package)

# Scatterplot
p1 <-
  ggplot(data = dt, aes(x = `Lot Area`, y = SalePrice)) + 
    geom_point(size = .7, color = "#005100") +
    scale_x_continuous(limits = c(0, 50000) , expand = c(0, 0)) +
    scale_y_continuous(limits = c(0, 600000), expand = c(0, 0)) +
    labs(title = "Scatterplot Sale Price by Lot Area") +
    ylab(label = "Sale Price") + 
    xlab(label = "Lot area") +
    # theme_classic() +
    theme(
      axis.title.x = element_text(
        colour = "#370037", size = 11.5, face = "bold"), 
      axis.title.y = element_text(
        colour = "#370037", size = 11.5, face = "bold"), 
      plot.title = element_text(
        colour = "#370037", size = 18  , face = "bold", hjust = 0))

# Side-by-side plots, only 1 
grid.arrange(p1, nrow = 1)
Warning: Removed 20 rows containing missing values (`geom_point()`).

Code
# Boxplot
dt[, avgSP := mean(SalePrice), by = Neighborhood]       %>%
  .[, Neighborhood := fct_reorder(Neighborhood, avgSP)] %>%
  .[, avgSP := NULL]                                    %>%
  ggplot(aes(x = Neighborhood, y = SalePrice)) + 
    geom_boxplot(color = "#005100", fill = "#FFFFF0") +
    labs(title = "Boxplot Sale Price by Neighbourhood") +
    ylab(label = "Sale Price") + 
    xlab(label = "Neighbourhood") +
    # theme_classic() +
    theme(
      axis.title.x = element_text(
        colour = "#370037", size = 11.5, face = "bold"), 
      axis.title.y = element_text(
        colour = "#370037", size = 11.5, face = "bold"), 
      plot.title = element_text(
        colour = "#370037", size = 18  , face = "bold", hjust = 0),
      axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)
    )

Visualize Y by House Style

Note

Box-plots are sorted by the mean of the dependent variable (SalePrice).  The mean of the dependent variable is calculated for each level of the independent variable (House Style).
The levels of the independent variable are reordered based on the mean of the dependent variable.

Code
#|label: Look at differences between housing style
dt[, avgHS := mean(SalePrice), by = `House Style`]        %>%
  .[, `House Style` := fct_reorder(`House Style`, avgHS)] %>%
  .[, avgHS := NULL]                                    %>%  
  ggplot(aes(x = `House Style`, y = SalePrice)) + 
    geom_boxplot(color = "#005100", fill = "#FFFFF0") +
    labs(title = "Boxplot Sale Price by House Style") +
    ylab(label = "Sale Price") + 
    xlab(label = "House Style") +
    # theme_classic() +
    theme(
      axis.title.x = element_text(
        colour = "#370037", size = 11.5, face = "bold"), 
      axis.title.y = element_text(
        colour = "#370037", size = 11.5, face = "bold"), 
      plot.title = element_text(
        colour = "#370037", size = 18  , face = "bold", hjust = 0),
      axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)
    )

Correlation plot

Code
# corr_df <- 
#   df               %>% 
#   keep(is.numeric) %>% 
#   cor

 corr_dt <-
  dt[, ..dt_int]   %>% 
  cor(
    use    = "everything", 
    method = "pearson"
  ) 

corrplot(
  corr          = corr_dt, 
  type          = "upper",
  title         = "Correlation between all numeric variables in the dataset", 
  diag          = FALSE,
  order         = 'hclust',
  hclust.method = 'median',
  addrect       = 3,
  number.font   = 2, 
  tl.cex        = 0.50,
  mar           = c(0, 0, 1, 0)
)

Code
corr_dt[, "SalePrice"]      %>%
  as.data.table(
    keep.rownames = "var",
    check.names   = FALSE
  )                         %>%
  setnames(".", "corr")     %>%
  setorder(-corr)           %>%
  kbl()
var corr
SalePrice 1.0000000
Overall Qual 0.7992618
Gr Liv Area 0.7067799
Garage Cars 0.6478115
Garage Area 0.6403811
Total Bsmt SF 0.6321639
1st Flr SF 0.6216761
Year Built 0.5584261
Full Bath 0.5456039
Year Remod/Add 0.5329738
Garage Yr Blt 0.5088825
Mas Vnr Area 0.5021960
TotRms AbvGrd 0.4954744
Fireplaces 0.4745581
BsmtFin SF 1 0.4328618
Lot Frontage 0.3402558
Wood Deck SF 0.3271432
Open Porch SF 0.3129505
Half Bath 0.2850560
Bsmt Full Bath 0.2758227
2nd Flr SF 0.2693734
Lot Area 0.2665492
Bsmt Unf SF 0.1828955
Bedroom AbvGr 0.1439134
Screen Porch 0.1121512
Pool Area 0.0684032
Mo Sold 0.0352588
3Ssn Porch 0.0322246
BsmtFin SF 2 0.0060176
Misc Val -0.0156915
Yr Sold -0.0305691
Order -0.0314079
Bsmt Half Bath -0.0358166
Low Qual Fin SF -0.0376598
MS SubClass -0.0850916
Overall Cond -0.1016969
Kitchen AbvGr -0.1198137
Enclosed Porch -0.1287874
PID -0.2465212

Exercise 5 - Estimate a Linear Regression, a LASSO and a kNN model

Now that we have a better feeling of the information in the data set and we took care of the missing values, we can start by running some (additional) simple machine learning models.

We will use the “caret”-package for this exercise. Split the data randomly into a train set (70%) and test set (30%)

train-test split

Code
set.seed(1234)

# use the caret::createDataPartition function to split the data
Index <- 
  createDataPartition(dt$Order, p = 0.7, list = FALSE)

train <- dt[ Index, ]
test  <- dt[-Index, ]

Cross-validation

Next we need to specify how we want to perform the cross-validation (i.e. the optimization of the model on the train set). To this extend we need to set the method of CV, the number of folds and the numer of times we want to repeat the process. We will use the “repeatedcv” method, with 10 folds and 3 repeats.

Code
# Cross-validation strategy from the caret package
ctrl <- 
  trainControl(
    method  = "repeatedcv",
    number  = 5,   # ten folds
    repeats = 3)   # repeated three times

Modeling

  1. Estimate a Linear Regression model
  2. Estimate a LASSO model
  3. Estimate a kNN model

Linear Regression model

Code
# Scatterplot with smoother lm

copy(dt[, ..dt_int]) %>%
  melt.data.table(
    id.vars      = c("Order", "SalePrice")
  ) %>%
  ggplot(
    aes(x = value     , y = SalePrice)) + 
  geom_point(size = .7, color = "#005100") +
  geom_smooth(
    method = "lm",
    se     = FALSE, 
    color  = "#0000FF",
    lwd    = 2 ) +
  facet_wrap(
    ncol   = 4, 
    facets = ~ variable, 
    scales = "free") +
  # coord_cartesian(
  #   xlim = c(0,  50000),
  #   ylim = c(0, 600000)) +
  labs(title = "Scatterplot Sale Price by Variable") +
  ylab(label = "Sale Price") + 
  # xlab(label = "Lot area") +
  # theme_classic() +
  theme(
    axis.title.x = element_text(
      colour = "#370037", size = 11.5, face = "bold"), 
    axis.title.y = element_text(
      colour = "#370037", size = 11.5, face = "bold"), 
    plot.title = element_text(
      colour = "#370037", size = 18  , face = "bold", hjust = 0))

Code
# Side-by-side plots
# grid.arrange(p2, nrow = 1)

Calculate how well the model explains the variance in the data (R2).

Code
# Fit the linear regression model on the training data
model <- lm(SalePrice ~ ., data = train)

# View the summary of the model
sum_mod <- summary(model) 
paste(
  "Multiple R-squared:", round(sum_mod$r.squared    , 3),   
  "Adjusted R-squared:", round(sum_mod$adj.r.squared, 3)
)
[1] "Multiple R-squared: 0.949 Adjusted R-squared: 0.942"
Code
# Extract the coefficients and their standard errors
coefficients <- coef(model)

# Extract the p-values for each coefficient
p_values <- summary(model)$coefficients[, "Pr(>|t|)"]

# Coefficients: (6 not defined because of singularities)
setdiff(names(coefficients), names(p_values))
[1] "`Mas Vnr Type`CBlock" "`Bsmt Cond`TA"        "`BsmtFin Type 1`Unf" 
[4] "`Gr Liv Area`"        "`Garage Qual`TA"      "`Garage Cond`TA"     
Code
# create table with the coefficients and their importance measures
data.table(
  Variable    = names(p_values),
  Coefficient = coefficients[names(p_values)],
  P_Value     = p_values,
  Importance  = abs(coefficients[names(p_values)]) / sum(abs(coefficients[names(p_values)]))
) %>%
.[P_Value < 0.05]     %>%
setorder(-P_Value)    %>%
DT::datatable(
  caption    = "Table 6: Linear Regression model",
  class      = "compact stripe",
  rownames   = FALSE,
  filter     = 'top',
  extensions = c('FixedColumns'),
  options = list(
    scrollX      = TRUE,
    fixedColumns = list(leftColumns = 1)
    )
 )  %>% 
  formatStyle(
    'Variable',
    color              = "#003700",
    fontWeight         = "bold",
    backgroundColor    = '#FFFFF0'     
  ) 

What does it mean when in a linear regression model you have singularities and how to solve this?

A singularity in a linear regression model means that one or more of the independent variables can be expressed as a linear combination of the other independent variables.

This is a problem because it means that the model cannot distinguish between the effects of the variables that are linearly dependent.

To address singularities caused by multicollinearity, you can take the following steps:

Identify the variables causing multicollinearity: Look for high pairwise correlations or examine variance inflation factors (VIF) to identify the variables that contribute to multicollinearity.

Resolve multicollinearity:

Remove one or more of the highly correlated variables. Combine correlated variables to create new composite variables. Use dimensionality reduction techniques like principal component analysis (PCA). Assess the impact: Re-estimate the model after resolving multicollinearity and examine the changes in coefficients, standard errors, and significance levels.

— chatGPT

Can I use AIC to determine which variables I need to use in my linear regression model?

Yes, you can use the Akaike Information Criterion (AIC) to determine which variables to include in your linear regression model. > The AIC is a metric that balances the goodness of fit of a model with its complexity, penalizing models with more parameters.

The general idea is to compare the AIC values of different models with different sets of variables and select the model with the > lowest AIC as the preferred model.

We initially fit a model using all potential variables. Then, we iterate over each variable and fit models without each variable, calculating the AIC for each reduced model. The variables with the lowest AIC values are considered the most informative and are selected for the final model.

— chatGPT
Code
# set hyperparameter k
k <- 10

# Use the lm model generated 
initial_model <- model

# Calculate AIC for the initial model
initial_aic <- AIC(initial_model)

# Initialize a list to store the AIC values
aic_values <- list()

# train_df        <- as.data.frame(train)
train_dt <- 
  copy(train) %>%
  setNames(gsub(" ", "_", names(.)))

# names(train_dft <- gsub(" ", "_", names(train_dt))

# Iterate over each variable to evaluate its contribution to the model
for (var in names(train_dt)) {
  # Skip the dependent variable
  if (var == "SalePrice")
    next
  
  # Fit a model without the current variable
  reduced_model <- lm(formula(paste("SalePrice ~ . -", var)), data = train_dt)
  
  # Calculate AIC for the reduced model
  aic <- AIC(reduced_model)
  
  # Store the AIC value in the list
  aic_values[[var]] <- aic
}

# Sort the AIC values in ascending order
sorted_aic <- sort(unlist(aic_values))

# Identify the variables with the lowest AIC values
selected_vars <- names(sorted_aic)[1:k]

# Build the final model using the selected variables
final_model <- 
  lm(SalePrice ~ ., data = train_dt[, c("SalePrice", selected_vars), with = FALSE])

Lasso model

Code
lambda <- 10^seq(-3, 3, length = 100)

lassoFit <- 
  train(
    SalePrice ~ ., 
    data       = train, 
    method     = "glmnet", 
    trControl  = ctrl, 
    preProcess = c("center", "scale"),
    tuneGrid   = expand.grid(alpha = 1, lambda = lambda))
Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSewr,
NeighborhoodLandmrk, `Condition 2`RRNn, `Roof Matl`Membran, `Roof Matl`Metal,
`Exterior 1st`ImStucc, `Exterior 1st`PreCast, `Exterior 2nd`Other, `Exterior
2nd`PreCast, ElectricalMix, FunctionalSal, FunctionalSev, `Misc Feature`TenC
Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSewr,
NeighborhoodLandmrk, NeighborhoodGrnHill, `Condition 2`RRAe, `Exterior
1st`AsphShn, `Exterior 1st`PreCast, `Exterior 1st`Stone, `Exterior 2nd`Other,
`Exterior 2nd`PreCast, `Exter Cond`Po, `Bsmt Qual`Po, FunctionalSal, `Misc
Feature`Elev, `Misc Feature`TenC
Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSewr,
NeighborhoodLandmrk, `Condition 2`RRAn, `Exterior 1st`PreCast, `Exterior
2nd`Other, `Exterior 2nd`PreCast, `Mas Vnr Type`CBlock, FunctionalSal, `Misc
Feature`TenC
Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSewr,
NeighborhoodLandmrk, `Exterior 1st`PreCast, `Exterior 2nd`Other, `Exterior
2nd`PreCast, HeatingOthW, FunctionalSal, `Pool QC`Fa, `Misc Feature`TenC
Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSeWa,
UtilitiesNoSewr, NeighborhoodLandmrk, `Condition 2`PosN, `Roof Matl`Roll,
`Exterior 1st`PreCast, `Exterior 2nd`Other, `Exterior 2nd`PreCast, `Kitchen
Qual`Po, FunctionalSal, `Misc Feature`TenC, `Sale Type`VWD
Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSewr,
NeighborhoodLandmrk, `Exterior 1st`ImStucc, `Exterior 1st`PreCast, `Exterior
2nd`Other, `Exterior 2nd`PreCast, `Mas Vnr Type`CBlock, HeatingOthW,
FunctionalSal, `Misc Feature`TenC, `Sale Type`VWD
Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSewr,
NeighborhoodLandmrk, `Condition 2`PosN, `Exterior 1st`PreCast, `Exterior
2nd`Other, `Exterior 2nd`PreCast, FunctionalSal, `Pool QC`Fa, `Misc
Feature`Elev, `Misc Feature`TenC
Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSeWa,
UtilitiesNoSewr, NeighborhoodLandmrk, `Condition 2`RRNn, `Roof Matl`Membran,
`Exterior 1st`AsphShn, `Exterior 1st`PreCast, `Exterior 2nd`Other, `Exterior
2nd`PreCast, ElectricalMix, FunctionalSal, `Misc Feature`TenC
Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSewr,
NeighborhoodLandmrk, `Roof Matl`Metal, `Exterior 1st`PreCast, `Exterior
1st`Stone, `Exterior 2nd`Other, `Exterior 2nd`PreCast, `Exter Cond`Po, `Bsmt
Qual`Po, FunctionalSal, `Misc Feature`TenC
Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSewr,
NeighborhoodLandmrk, NeighborhoodGrnHill, `Condition 2`RRAe, `Condition 2`RRAn,
`Roof Matl`Roll, `Exterior 1st`PreCast, `Exterior 2nd`Other, `Exterior
2nd`PreCast, `Kitchen Qual`Po, FunctionalSal, FunctionalSev, `Misc
Feature`Gar2, `Misc Feature`TenC
Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSewr,
NeighborhoodLandmrk, `Condition 2`RRAe, `Roof Style`Shed, `Roof Matl`Metal,
`Exterior 1st`ImStucc, `Exterior 1st`PreCast, `Exterior 2nd`Other, `Exterior
2nd`PreCast, `Mas Vnr Type`CBlock, FunctionalSal, `Pool QC`Fa, `Misc
Feature`TenC
Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSeWa,
UtilitiesNoSewr, NeighborhoodLandmrk, NeighborhoodGrnHill, `Exterior
1st`PreCast, `Exterior 2nd`Other, `Exterior 2nd`PreCast, `Exter Cond`Po,
FunctionalSal, `Misc Feature`Elev, `Misc Feature`TenC, `Sale Type`VWD
Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSewr,
NeighborhoodLandmrk, `Condition 2`RRAn, `Roof Matl`Membran, `Roof Matl`Roll,
`Exterior 1st`PreCast, `Exterior 2nd`Other, `Exterior 2nd`PreCast, `Bsmt
Qual`Po, FunctionalSal, `Misc Feature`TenC
Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSewr,
NeighborhoodLandmrk, `Condition 2`PosN, `Condition 2`RRNn, `Exterior
1st`AsphShn, `Exterior 1st`PreCast, `Exterior 1st`Stone, `Exterior 2nd`Other,
`Exterior 2nd`PreCast, FunctionalSal, FunctionalSev, `Misc Feature`TenC
Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSewr,
NeighborhoodLandmrk, `Exterior 1st`PreCast, `Exterior 2nd`Other, `Exterior
2nd`PreCast, HeatingOthW, ElectricalMix, `Kitchen Qual`Po, FunctionalSal, `Misc
Feature`TenC
Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSewr,
NeighborhoodLandmrk, `Exterior 1st`PreCast, `Exterior 2nd`Other, `Exterior
2nd`PreCast, FunctionalSal, `Misc Feature`TenC
Code
lassoFit               # to obtain summary of the model
glmnet 

2054 samples
  81 predictor

Pre-processing: centered (279), scaled (279) 
Resampling: Cross-Validated (5 fold, repeated 3 times) 
Summary of sample sizes: 1644, 1643, 1644, 1642, 1643, 1643, ... 
Resampling results across tuning parameters:

  lambda          RMSE      Rsquared   MAE     
     0.001000000  41936.43  0.7607009  17194.09
     0.001149757  41936.43  0.7607009  17194.09
     0.001321941  41936.43  0.7607009  17194.09
     0.001519911  41936.43  0.7607009  17194.09
     0.001747528  41936.43  0.7607009  17194.09
     0.002009233  41936.43  0.7607009  17194.09
     0.002310130  41936.43  0.7607009  17194.09
     0.002656088  41936.43  0.7607009  17194.09
     0.003053856  41936.43  0.7607009  17194.09
     0.003511192  41936.43  0.7607009  17194.09
     0.004037017  41936.43  0.7607009  17194.09
     0.004641589  41936.43  0.7607009  17194.09
     0.005336699  41936.43  0.7607009  17194.09
     0.006135907  41936.43  0.7607009  17194.09
     0.007054802  41936.43  0.7607009  17194.09
     0.008111308  41936.43  0.7607009  17194.09
     0.009326033  41936.43  0.7607009  17194.09
     0.010722672  41936.43  0.7607009  17194.09
     0.012328467  41936.43  0.7607009  17194.09
     0.014174742  41936.43  0.7607009  17194.09
     0.016297508  41936.43  0.7607009  17194.09
     0.018738174  41936.43  0.7607009  17194.09
     0.021544347  41936.43  0.7607009  17194.09
     0.024770764  41936.43  0.7607009  17194.09
     0.028480359  41936.43  0.7607009  17194.09
     0.032745492  41936.43  0.7607009  17194.09
     0.037649358  41936.43  0.7607009  17194.09
     0.043287613  41936.43  0.7607009  17194.09
     0.049770236  41936.43  0.7607009  17194.09
     0.057223677  41936.43  0.7607009  17194.09
     0.065793322  41936.43  0.7607009  17194.09
     0.075646333  41936.43  0.7607009  17194.09
     0.086974900  41936.43  0.7607009  17194.09
     0.100000000  41936.43  0.7607009  17194.09
     0.114975700  41936.43  0.7607009  17194.09
     0.132194115  41936.43  0.7607009  17194.09
     0.151991108  41936.43  0.7607009  17194.09
     0.174752840  41936.43  0.7607009  17194.09
     0.200923300  41936.43  0.7607009  17194.09
     0.231012970  41936.43  0.7607009  17194.09
     0.265608778  41936.43  0.7607009  17194.09
     0.305385551  41936.43  0.7607009  17194.09
     0.351119173  41936.43  0.7607009  17194.09
     0.403701726  41936.43  0.7607009  17194.09
     0.464158883  41936.43  0.7607009  17194.09
     0.533669923  41936.43  0.7607009  17194.09
     0.613590727  41936.43  0.7607009  17194.09
     0.705480231  41936.43  0.7607009  17194.09
     0.811130831  41936.43  0.7607009  17194.09
     0.932603347  41936.43  0.7607009  17194.09
     1.072267222  41936.43  0.7607009  17194.09
     1.232846739  41936.43  0.7607009  17194.09
     1.417474163  41936.43  0.7607009  17194.09
     1.629750835  41936.43  0.7607009  17194.09
     1.873817423  41936.43  0.7607009  17194.09
     2.154434690  41936.43  0.7607009  17194.09
     2.477076356  41936.43  0.7607009  17194.09
     2.848035868  41936.43  0.7607009  17194.09
     3.274549163  41936.43  0.7607009  17194.09
     3.764935807  41936.43  0.7607009  17194.09
     4.328761281  41936.43  0.7607009  17194.09
     4.977023564  41936.43  0.7607009  17194.09
     5.722367659  41936.43  0.7607009  17194.09
     6.579332247  41931.42  0.7607403  17193.22
     7.564633276  41875.71  0.7611846  17186.72
     8.697490026  41811.16  0.7617076  17176.53
    10.000000000  41729.73  0.7623733  17162.75
    11.497569954  41638.96  0.7631148  17147.19
    13.219411485  41535.06  0.7639686  17129.60
    15.199110830  41424.41  0.7648811  17110.27
    17.475284000  41299.13  0.7659093  17089.37
    20.092330026  41136.09  0.7672316  17063.13
    23.101297001  40927.51  0.7689158  17033.53
    26.560877829  40710.69  0.7706659  17001.18
    30.538555088  40477.17  0.7725514  16966.42
    35.111917342  40221.78  0.7746299  16926.88
    40.370172586  39932.47  0.7769898  16882.78
    46.415888336  39627.35  0.7794709  16838.41
    53.366992312  39308.89  0.7820395  16794.80
    61.359072734  38968.04  0.7847822  16745.74
    70.548023107  38584.07  0.7878782  16690.45
    81.113083079  38145.18  0.7914312  16630.57
    93.260334688  37765.64  0.7944114  16575.12
   107.226722201  37335.83  0.7978204  16521.39
   123.284673944  36813.12  0.8020534  16460.89
   141.747416293  36193.39  0.8071349  16390.83
   162.975083462  35512.57  0.8127587  16321.89
   187.381742286  34777.46  0.8188310  16257.48
   215.443469003  33997.07  0.8252792  16196.30
   247.707635599  33285.87  0.8311407  16156.47
   284.803586844  32564.24  0.8370304  16131.39
   327.454916288  31867.12  0.8426471  16125.18
   376.493580679  31237.21  0.8476845  16132.50
   432.876128108  30809.27  0.8510071  16157.94
   497.702356433  30652.24  0.8520907  16191.88
   572.236765935  30593.71  0.8524009  16236.99
   657.933224658  30562.75  0.8524742  16312.17
   756.463327555  30547.32  0.8524215  16387.81
   869.749002618  30541.04  0.8522912  16459.52
  1000.000000000  30536.35  0.8521574  16545.32

Tuning parameter 'alpha' was held constant at a value of 1
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were alpha = 1 and lambda = 1000.
Code
varImp(lassoFit)       # to see most important parameters
glmnet variable importance

  only 20 most important variables shown (out of 279)

                    Overall
`Gr Liv Area`        100.00
`Overall Qual`        64.67
`Misc Feature`Elev    35.36
`Bsmt Qual`Ex         34.89
`Condition 2`PosN     27.94
NeighborhoodNridgHt   27.35
`MS SubClass`         25.70
`Bsmt Exposure`Gd     20.69
NeighborhoodStoneBr   20.39
NeighborhoodNoRidge   19.84
`Sale Type`New        18.82
`Pool QC`Gd           18.47
`Year Built`          18.28
`Mas Vnr Area`        17.13
`BsmtFin SF 1`        16.79
`Total Bsmt SF`       16.08
`Garage Cars`         15.42
`Overall Cond`        14.57
`Lot Area`            13.28
Fireplaces            12.61
Code
plot(varImp(lassoFit)) # to plot most important parameters

kNN model

Code
## Run kNN
knnFit <- 
  train(
    SalePrice ~ ., 
    data       = train, 
    method     = "knn", 
    trControl  = ctrl, 
    preProcess = c("center", "scale")
  )
Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSewr,
NeighborhoodLandmrk, `Roof Matl`Membran, `Exterior 1st`ImStucc, `Exterior
1st`PreCast, `Exterior 2nd`Other, `Exterior 2nd`PreCast, `Bsmt Cond`Ex,
HeatingOthW, FunctionalSal, `Misc Feature`TenC

Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSewr,
NeighborhoodLandmrk, `Roof Matl`Membran, `Exterior 1st`ImStucc, `Exterior
1st`PreCast, `Exterior 2nd`Other, `Exterior 2nd`PreCast, `Bsmt Cond`Ex,
HeatingOthW, FunctionalSal, `Misc Feature`TenC

Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSewr,
NeighborhoodLandmrk, `Roof Matl`Membran, `Exterior 1st`ImStucc, `Exterior
1st`PreCast, `Exterior 2nd`Other, `Exterior 2nd`PreCast, `Bsmt Cond`Ex,
HeatingOthW, FunctionalSal, `Misc Feature`TenC
Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSewr,
NeighborhoodLandmrk, `Condition 2`RRNn, `Exterior 1st`PreCast, `Exterior
2nd`Other, `Exterior 2nd`PreCast, `Bsmt Qual`Po, FunctionalSal, `Misc
Feature`TenC, `Sale Type`VWD

Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSewr,
NeighborhoodLandmrk, `Condition 2`RRNn, `Exterior 1st`PreCast, `Exterior
2nd`Other, `Exterior 2nd`PreCast, `Bsmt Qual`Po, FunctionalSal, `Misc
Feature`TenC, `Sale Type`VWD

Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSewr,
NeighborhoodLandmrk, `Condition 2`RRNn, `Exterior 1st`PreCast, `Exterior
2nd`Other, `Exterior 2nd`PreCast, `Bsmt Qual`Po, FunctionalSal, `Misc
Feature`TenC, `Sale Type`VWD
Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSewr,
NeighborhoodLandmrk, `Condition 2`PosN, `Condition 2`RRAe, `Exterior
1st`AsphShn, `Exterior 1st`PreCast, `Exterior 1st`Stone, `Exterior 2nd`Other,
`Exterior 2nd`PreCast, `Exter Cond`Po, ElectricalMix, `Kitchen Qual`Po,
FunctionalSal, FunctionalSev, `Pool QC`Fa, `Misc Feature`Elev, `Misc
Feature`TenC

Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSewr,
NeighborhoodLandmrk, `Condition 2`PosN, `Condition 2`RRAe, `Exterior
1st`AsphShn, `Exterior 1st`PreCast, `Exterior 1st`Stone, `Exterior 2nd`Other,
`Exterior 2nd`PreCast, `Exter Cond`Po, ElectricalMix, `Kitchen Qual`Po,
FunctionalSal, FunctionalSev, `Pool QC`Fa, `Misc Feature`Elev, `Misc
Feature`TenC

Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSewr,
NeighborhoodLandmrk, `Condition 2`PosN, `Condition 2`RRAe, `Exterior
1st`AsphShn, `Exterior 1st`PreCast, `Exterior 1st`Stone, `Exterior 2nd`Other,
`Exterior 2nd`PreCast, `Exter Cond`Po, ElectricalMix, `Kitchen Qual`Po,
FunctionalSal, FunctionalSev, `Pool QC`Fa, `Misc Feature`Elev, `Misc
Feature`TenC
Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSeWa,
UtilitiesNoSewr, NeighborhoodLandmrk, `Roof Matl`Metal, `Exterior 1st`PreCast,
`Exterior 2nd`Other, `Exterior 2nd`PreCast, FunctionalSal, `Misc Feature`TenC

Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSeWa,
UtilitiesNoSewr, NeighborhoodLandmrk, `Roof Matl`Metal, `Exterior 1st`PreCast,
`Exterior 2nd`Other, `Exterior 2nd`PreCast, FunctionalSal, `Misc Feature`TenC

Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSeWa,
UtilitiesNoSewr, NeighborhoodLandmrk, `Roof Matl`Metal, `Exterior 1st`PreCast,
`Exterior 2nd`Other, `Exterior 2nd`PreCast, FunctionalSal, `Misc Feature`TenC
Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSewr,
NeighborhoodLandmrk, NeighborhoodGrnHill, `Condition 2`RRAn, `Roof Matl`Roll,
`Exterior 1st`CBlock, `Exterior 1st`PreCast, `Exterior 2nd`Other, `Exterior
2nd`PreCast, `Mas Vnr Type`CBlock, FunctionalSal, `Misc Feature`TenC

Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSewr,
NeighborhoodLandmrk, NeighborhoodGrnHill, `Condition 2`RRAn, `Roof Matl`Roll,
`Exterior 1st`CBlock, `Exterior 1st`PreCast, `Exterior 2nd`Other, `Exterior
2nd`PreCast, `Mas Vnr Type`CBlock, FunctionalSal, `Misc Feature`TenC

Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSewr,
NeighborhoodLandmrk, NeighborhoodGrnHill, `Condition 2`RRAn, `Roof Matl`Roll,
`Exterior 1st`CBlock, `Exterior 1st`PreCast, `Exterior 2nd`Other, `Exterior
2nd`PreCast, `Mas Vnr Type`CBlock, FunctionalSal, `Misc Feature`TenC
Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSewr,
NeighborhoodLandmrk, NeighborhoodGrnHill, `Exterior 1st`PreCast, `Exterior
2nd`Other, `Exterior 2nd`PreCast, `Bsmt Qual`Po, FunctionalSal, `Pool QC`TA,
`Misc Feature`TenC

Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSewr,
NeighborhoodLandmrk, NeighborhoodGrnHill, `Exterior 1st`PreCast, `Exterior
2nd`Other, `Exterior 2nd`PreCast, `Bsmt Qual`Po, FunctionalSal, `Pool QC`TA,
`Misc Feature`TenC

Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSewr,
NeighborhoodLandmrk, NeighborhoodGrnHill, `Exterior 1st`PreCast, `Exterior
2nd`Other, `Exterior 2nd`PreCast, `Bsmt Qual`Po, FunctionalSal, `Pool QC`TA,
`Misc Feature`TenC
Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSewr,
NeighborhoodLandmrk, `Roof Matl`Membran, `Exterior 1st`AsphShn, `Exterior
1st`PreCast, `Exterior 2nd`Other, `Exterior 2nd`PreCast, `Exter Cond`Po,
ElectricalMix, `Kitchen Qual`Po, FunctionalSal, FunctionalSev, `Misc
Feature`TenC

Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSewr,
NeighborhoodLandmrk, `Roof Matl`Membran, `Exterior 1st`AsphShn, `Exterior
1st`PreCast, `Exterior 2nd`Other, `Exterior 2nd`PreCast, `Exter Cond`Po,
ElectricalMix, `Kitchen Qual`Po, FunctionalSal, FunctionalSev, `Misc
Feature`TenC

Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSewr,
NeighborhoodLandmrk, `Roof Matl`Membran, `Exterior 1st`AsphShn, `Exterior
1st`PreCast, `Exterior 2nd`Other, `Exterior 2nd`PreCast, `Exter Cond`Po,
ElectricalMix, `Kitchen Qual`Po, FunctionalSal, FunctionalSev, `Misc
Feature`TenC
Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSewr,
NeighborhoodLandmrk, `Condition 2`RRNn, `Roof Matl`Metal, `Roof Matl`Roll,
`Exterior 1st`ImStucc, `Exterior 1st`PreCast, `Exterior 1st`Stone, `Exterior
2nd`Other, `Exterior 2nd`PreCast, `Mas Vnr Type`CBlock, `Bsmt Cond`Ex,
HeatingOthW, FunctionalSal, `Pool QC`Fa, `Misc Feature`Elev, `Misc Feature`TenC

Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSewr,
NeighborhoodLandmrk, `Condition 2`RRNn, `Roof Matl`Metal, `Roof Matl`Roll,
`Exterior 1st`ImStucc, `Exterior 1st`PreCast, `Exterior 1st`Stone, `Exterior
2nd`Other, `Exterior 2nd`PreCast, `Mas Vnr Type`CBlock, `Bsmt Cond`Ex,
HeatingOthW, FunctionalSal, `Pool QC`Fa, `Misc Feature`Elev, `Misc Feature`TenC

Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSewr,
NeighborhoodLandmrk, `Condition 2`RRNn, `Roof Matl`Metal, `Roof Matl`Roll,
`Exterior 1st`ImStucc, `Exterior 1st`PreCast, `Exterior 1st`Stone, `Exterior
2nd`Other, `Exterior 2nd`PreCast, `Mas Vnr Type`CBlock, `Bsmt Cond`Ex,
HeatingOthW, FunctionalSal, `Pool QC`Fa, `Misc Feature`Elev, `Misc Feature`TenC
Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSeWa,
UtilitiesNoSewr, NeighborhoodLandmrk, `Exterior 1st`PreCast, `Exterior
2nd`Other, `Exterior 2nd`PreCast, FunctionalSal, `Misc Feature`TenC, `Sale
Type`VWD

Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSeWa,
UtilitiesNoSewr, NeighborhoodLandmrk, `Exterior 1st`PreCast, `Exterior
2nd`Other, `Exterior 2nd`PreCast, FunctionalSal, `Misc Feature`TenC, `Sale
Type`VWD

Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSeWa,
UtilitiesNoSewr, NeighborhoodLandmrk, `Exterior 1st`PreCast, `Exterior
2nd`Other, `Exterior 2nd`PreCast, FunctionalSal, `Misc Feature`TenC, `Sale
Type`VWD
Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSewr,
NeighborhoodLandmrk, `Condition 2`PosN, `Condition 2`RRAe, `Condition 2`RRAn,
`Exterior 1st`PreCast, `Exterior 2nd`Other, `Exterior 2nd`PreCast, `Heating
QC`Po, FunctionalSal, `Misc Feature`TenC

Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSewr,
NeighborhoodLandmrk, `Condition 2`PosN, `Condition 2`RRAe, `Condition 2`RRAn,
`Exterior 1st`PreCast, `Exterior 2nd`Other, `Exterior 2nd`PreCast, `Heating
QC`Po, FunctionalSal, `Misc Feature`TenC

Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSewr,
NeighborhoodLandmrk, `Condition 2`PosN, `Condition 2`RRAe, `Condition 2`RRAn,
`Exterior 1st`PreCast, `Exterior 2nd`Other, `Exterior 2nd`PreCast, `Heating
QC`Po, FunctionalSal, `Misc Feature`TenC
Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSeWa,
UtilitiesNoSewr, NeighborhoodLandmrk, NeighborhoodGrnHill, `Condition 2`PosA,
`Condition 2`RRAn, `Exterior 1st`PreCast, `Exterior 2nd`Other, `Exterior
2nd`PreCast, FunctionalSal, FunctionalSev, `Misc Feature`TenC

Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSeWa,
UtilitiesNoSewr, NeighborhoodLandmrk, NeighborhoodGrnHill, `Condition 2`PosA,
`Condition 2`RRAn, `Exterior 1st`PreCast, `Exterior 2nd`Other, `Exterior
2nd`PreCast, FunctionalSal, FunctionalSev, `Misc Feature`TenC

Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSeWa,
UtilitiesNoSewr, NeighborhoodLandmrk, NeighborhoodGrnHill, `Condition 2`PosA,
`Condition 2`RRAn, `Exterior 1st`PreCast, `Exterior 2nd`Other, `Exterior
2nd`PreCast, FunctionalSal, FunctionalSev, `Misc Feature`TenC
Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSewr,
NeighborhoodLandmrk, NeighborhoodGreens, `Condition 2`PosN, `Roof Matl`Metal,
`Exterior 1st`ImStucc, `Exterior 1st`PreCast, `Exterior 2nd`Other, `Exterior
2nd`PreCast, ElectricalMix, FunctionalSal, `Misc Feature`Elev, `Misc
Feature`TenC, `Sale Type`VWD

Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSewr,
NeighborhoodLandmrk, NeighborhoodGreens, `Condition 2`PosN, `Roof Matl`Metal,
`Exterior 1st`ImStucc, `Exterior 1st`PreCast, `Exterior 2nd`Other, `Exterior
2nd`PreCast, ElectricalMix, FunctionalSal, `Misc Feature`Elev, `Misc
Feature`TenC, `Sale Type`VWD

Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSewr,
NeighborhoodLandmrk, NeighborhoodGreens, `Condition 2`PosN, `Roof Matl`Metal,
`Exterior 1st`ImStucc, `Exterior 1st`PreCast, `Exterior 2nd`Other, `Exterior
2nd`PreCast, ElectricalMix, FunctionalSal, `Misc Feature`Elev, `Misc
Feature`TenC, `Sale Type`VWD
Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSewr,
NeighborhoodLandmrk, `Condition 2`RRNn, `Roof Matl`Membran, `Exterior
1st`AsphShn, `Exterior 1st`PreCast, `Exterior 2nd`Other, `Exterior 2nd`PreCast,
FunctionalSal, `Pool QC`Fa, `Misc Feature`TenC

Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSewr,
NeighborhoodLandmrk, `Condition 2`RRNn, `Roof Matl`Membran, `Exterior
1st`AsphShn, `Exterior 1st`PreCast, `Exterior 2nd`Other, `Exterior 2nd`PreCast,
FunctionalSal, `Pool QC`Fa, `Misc Feature`TenC

Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSewr,
NeighborhoodLandmrk, `Condition 2`RRNn, `Roof Matl`Membran, `Exterior
1st`AsphShn, `Exterior 1st`PreCast, `Exterior 2nd`Other, `Exterior 2nd`PreCast,
FunctionalSal, `Pool QC`Fa, `Misc Feature`TenC
Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSewr,
NeighborhoodLandmrk, `Condition 2`RRAe, `Roof Matl`Roll, `Exterior 1st`PreCast,
`Exterior 2nd`Other, `Exterior 2nd`PreCast, `Exter Cond`Po, `Kitchen Qual`Po,
FunctionalSal, `Misc Feature`Gar2, `Misc Feature`TenC

Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSewr,
NeighborhoodLandmrk, `Condition 2`RRAe, `Roof Matl`Roll, `Exterior 1st`PreCast,
`Exterior 2nd`Other, `Exterior 2nd`PreCast, `Exter Cond`Po, `Kitchen Qual`Po,
FunctionalSal, `Misc Feature`Gar2, `Misc Feature`TenC

Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSewr,
NeighborhoodLandmrk, `Condition 2`RRAe, `Roof Matl`Roll, `Exterior 1st`PreCast,
`Exterior 2nd`Other, `Exterior 2nd`PreCast, `Exter Cond`Po, `Kitchen Qual`Po,
FunctionalSal, `Misc Feature`Gar2, `Misc Feature`TenC
Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSewr,
NeighborhoodLandmrk, `Exterior 1st`PreCast, `Exterior 1st`Stone, `Exterior
2nd`Other, `Exterior 2nd`PreCast, `Mas Vnr Type`CBlock, `Bsmt Qual`Po,
HeatingOthW, FunctionalSal, `Misc Feature`TenC

Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSewr,
NeighborhoodLandmrk, `Exterior 1st`PreCast, `Exterior 1st`Stone, `Exterior
2nd`Other, `Exterior 2nd`PreCast, `Mas Vnr Type`CBlock, `Bsmt Qual`Po,
HeatingOthW, FunctionalSal, `Misc Feature`TenC

Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSewr,
NeighborhoodLandmrk, `Exterior 1st`PreCast, `Exterior 1st`Stone, `Exterior
2nd`Other, `Exterior 2nd`PreCast, `Mas Vnr Type`CBlock, `Bsmt Qual`Po,
HeatingOthW, FunctionalSal, `Misc Feature`TenC
Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSewr,
NeighborhoodLandmrk, `Exterior 1st`PreCast, `Exterior 2nd`Other, `Exterior
2nd`PreCast, FunctionalSal, `Misc Feature`TenC
Code
knnFit               # to obtain summary of the model
k-Nearest Neighbors 

2054 samples
  81 predictor

Pre-processing: centered (279), scaled (279) 
Resampling: Cross-Validated (5 fold, repeated 3 times) 
Summary of sample sizes: 1643, 1643, 1644, 1643, 1643, 1643, ... 
Resampling results across tuning parameters:

  k  RMSE      Rsquared   MAE     
  5  38780.89  0.7663277  25177.59
  7  37898.73  0.7792425  24601.56
  9  37811.90  0.7817787  24396.81

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was k = 9.
Code
plot(knnFit)

Code
varImp(knnFit)       # to see most important parameters
loess r-squared variable importance

  only 20 most important variables shown (out of 81)

               Overall
Overall Qual    100.00
Neighborhood     80.72
Gr Liv Area      80.08
Total Bsmt SF    74.98
Garage Area      70.58
1st Flr SF       68.65
Garage Cars      66.43
Exter Qual       66.17
Kitchen Qual     56.39
Year Built       50.35
Full Bath        47.58
Year Remod/Add   44.92
BsmtFin SF 1     42.03
Garage Yr Blt    41.42
Mas Vnr Area     41.04
TotRms AbvGrd    39.77
Bsmt Qual        35.16
Fireplaces       34.42
2nd Flr SF       31.70
PID              30.44
Code
plot(varImp(knnFit)) # to plot most important parameters

Exercise 6 - Assess which model performs best

Evaluation

The performance metric for the prediction model should be the Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sale price. This makes it the Root-Mean-Squared-Log-Error (RMSLE). By plotting a histogram of the sale price you will understand why the logarithm is recommended.

Code
# Make predictions on the test data
predictions <- predict(model, newdata = test)

# Calculate evaluation metrics (e.g., RMSE)
rmse <- caret::RMSE(predictions, test$SalePrice)
Code
# LASSO
pred_lassoFit <- 
  predict(lassoFit, newdata = test)

lasso_rmse <- 
  rmse(
    actual    = test$SalePrice,
    predicted = pred_lassoFit
  ) %>% 
  round(3)

# KNN
pred_knn <- 
  predict(knnFit, newdata = test)

knn_rmse <- 
  rmse(
    actual    = test$SalePrice,
    predicted = pred_knn
  ) %>%
  round(3)

data.table(
  Model = c("Lasso"   , "KNN"),
  RMSE  = c(lasso_rmse, knn_rmse)
) %T>%
  setorder(RMSE)                  %>%
  .[, .(Rank= 1:.N, Model, RMSE)] %>%
  kbl(
    caption = "Model performance",
    align = 'l', 
    centering = F
  ) %>%
  kable_styling(
    full_width      = FALSE, 
    position        = "left",
    htmltable_class = "lighttable-hover lighttable-condensed lightable-striped"
  ) %>%

## Appendix
Code
data_description <- 
  paste0(github_ames, "data_description.txt") %>%
  readLines()

r data_description